Dynamic Programming

Dynamic Programming

Dynamic: Sequental or temporal component.
Programming: Optimizing a "program" or a sequence of decisions.

Dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems and optimal substructure.

Markov processes satisfy the both properties. The Bellman equation is a recursive equation that defines the value of a state as the sum of the immediate reward and the discounted value of the next state. The value function stores and reuses the solutions to the subproblems.


Policy Evaluation

Problem: Given a policy π, compute the value function V(s) for all states s ∈ S. Solution: Iterative application of the Bellman expectation equation.

Policy Evaluation Algorithm in Small Gridworld Example:

1234567891011121314\begin{array}{|c|c|c|c|} \hline & 1 & 2 & 3 \\ \hline 4 & 5 & 6 & 7 \\ \hline 8 & 9 & 10 & 11 \\ \hline 12 & 13 & 14 & \\ \hline \end{array}

Iteration 0 ->

0000000000000000\begin{array}{|c|c|c|c|} \hline  0  &  0  &  0  &  0  \\ \hline  0  &  0  &  0  &  0  \\ \hline  0  &  0  &  0  &  0  \\ \hline  0  &  0  &  0  &  0  \\ \hline \end{array}

Iteration 1 ->

0.01.001.001.001.001.001.001.001.001.001.001.001.001.001.000.00\begin{array}{|c|c|c|c|} \hline 0.0 & -1.00 & -1.00 & -1.00 \\ \hline -1.00 & -1.00 & -1.00 & -1.00 \\ \hline -1.00 & -1.00 & -1.00 & -1.00 \\ \hline -1.00 & -1.00 & -1.00 & 0.00 \\ \hline \end{array}

Iteration 2 ->

01.75221.752222221.75221.750\begin{array}{|c|c|c|c|} \hline 0 & -1.75 & -2 & -2 \\ \hline -1.75 & -2 & -2 & -2 \\ \hline -2 & -2 & -2 & -1.75 \\ \hline -2 & -2 & -1.75 & 0 \\ \hline \end{array}

Iteration 3 ->

02.442.752.752.442.752.752.442.752.752.4422.752.4420\begin{array}{|c|c|c|c|} \hline 0 & -2.44 & -2.75 & -2.75 \\ \hline -2.44 & -2.75 & -2.75 & -2.44 \\ \hline -2.75 & -2.75 & -2.44 & -2 \\ \hline -2.75 & -2.44 & -2 & 0 \\ \hline \end{array}

Iteration 100 ->

014.020.022.014.018.020.020.020.020.018.014.022.020.014.00\begin{array}{|c|c|c|c|} \hline 0 & -14.0 & -20.0 & -22.0 \\ \hline -14.0 & -18.0 & -20.0 & -20.0 \\ \hline -20.0 & -20.0 & -18.0 & -14.0 \\ \hline -22.0 & -20.0 & -14.0 & 0 \\ \hline \end{array}

Policy Iteration

Problem: Given an MDP, find the optimal policy π*.
Solution: Iterative application of policy evaluation and policy improvement.

Given policy π,


#MMI706 - Reinforcement Learning at METU